Space-time interaction: Different patterns by location AND time
Part 2: Understanding Panel Data
What is Panel Data?
Definition: Data that follows the same units over multiple time periods
Cross-sectional data:
Each row = one observation
House prices in 2024
One snapshot in time
Panel data:
Each row = unit × time period
Station × hour combinations
Repeated observations
Example: Bike Share Panel
Code
# Cross-sectional: One row per stationStation_A, May_2018, 4,250_total_tripsStation_B, May_2018, 2,100_total_trips# Panel: One row per station-hourStation_A, May_1_08:00, 12_tripsStation_A, May_1_09:00, 15_tripsStation_A, May_1_10:00, 8_tripsStation_B, May_1_08:00, 5_tripsStation_B, May_1_09:00, 7_trips
Key insight: Now we can see how demand changes WITHIN stations over time
Panel Data Structure
Station ID
Date-Hour
Trip Count
Temperature
Day of Week
1
2018-05-01 08:00
12
65°F
Tuesday
1
2018-05-01 09:00
15
67°F
Tuesday
1
2018-05-01 10:00
8
69°F
Tuesday
2
2018-05-01 08:00
5
65°F
Tuesday
2
2018-05-01 09:00
7
67°F
Tuesday
Each row = station-hour observation with features and outcome
Why Panel Data for Bike Share?
Station-specific baselines:
Station A (downtown): High demand during work hours
Station B (residential): High demand mornings/evenings
Station C (tourist area): High demand weekends
Time-based patterns:
Rush hour peaks
Weekend vs. weekday differences
Weather effects
Holiday impacts
Panel structure lets us capture BOTH station differences AND time patterns
Part 3: Binning Data into Time Intervals
Why Bin the Data?
Raw trip data:
Trip 1: Started at 8:05:23 AM
Trip 2: Started at 8:07:41 AM
Trip 3: Started at 8:15:12 AM
Trip 4: Started at 8:23:08 AM
Problem: Every trip starts at a unique timestamp
Can’t aggregate or find patterns at the second-level
Solution: Group trips into uniform time intervals (bins)
Binning in Practice
Hourly binning:
Code
# All trips between 8:00-8:59 AM → "08:00" bindat <- dat %>%mutate(interval60 =floor_date(ymd_hms(start_time), unit ="hour"))
Result:
Trip at 8:05 AM → 08:00 bin
Trip at 8:23 AM → 08:00 bin
Trip at 9:07 AM → 09:00 bin
Now we can count: “Station A had 15 trips in the 8:00 AM hour”
Alternative: 15-Minute Bins
Finer temporal resolution:
Code
dat <- dat %>%mutate(interval15 =floor_date(ymd_hms(start_time), unit ="15 mins"))
Trade-offs:
(+) More granular patterns (peak vs. off-peak within hour)
(+) Better for short-term forecasting
(-) More sparse data (some 15-min periods have zero trips)
(-) More complex models
Today: We’ll use hourly bins for simplicity
Extracting Time Features
Code
dat <- dat %>%mutate(interval60 =floor_date(ymd_hms(start_time), unit ="hour"),week =week(interval60), # Week of year (1-52)dotw =wday(interval60, label=TRUE), # Day of week (Mon, Tue, ...)hour =hour(interval60) # Hour of day (0-23) )
These become predictors:
Rush hour indicator: hour %in% c(7,8,9, 17,18,19)
Weekend indicator: dotw %in% c("Sat", "Sun")
Holiday effects: Memorial Day weekend
Part 4: Temporal Lags
What Are Temporal Lags?
Core idea: Past demand predicts future demand
Spatial features (Week 6):
Crimes within 500ft
Distance to downtown
Nearby amenities
Temporal features (This week):
Demand 1 hour ago
Demand 2 hours ago
Demand yesterday (24 hours ago)
Intuition: If there were 15 trips at 8 AM, there will probably be ~15 trips at 9 AM
lag1day (24 hours): Daily periodicity (same time yesterday)
Model will learn which lags are most predictive for each station/time combination
Connection to Week 6: Spatial Features
Remember Week 6?
Code
# Spatial features we created:boston.sf <- boston.sf %>%mutate(crimes_500ft = ..., # Count crimes nearbycrime_nn3 = ..., # Average crime at 3 nearest neighborsdist_downtown = ... # Distance to downtown )
This week is the same concept:
Code
# Temporal features we're creating:study.panel <- study.panel %>%mutate(lag1Hour = ..., # Demand at nearby TIMElag1day = ..., # Demand at similar TIME )
Part 5: Creating the Space-Time Panel
The Challenge: Missing Observations
Problem: Not every station has trips every hour
Code
# Data we have (sparse):Station_A, May_1_08:00, 12_trips ✓Station_A, May_1_09:00, 0_trips ✗ (missing row!)Station_A, May_1_10:00, 8_trips ✓
But we NEED:
Code
# Complete panel (every station-hour combination):Station_A, May_1_08:00, 12_tripsStation_A, May_1_09:00, 0_trips ← Must exist with 0!Station_A, May_1_10:00, 8_trips
Why? Lag calculations break if rows are missing
Creating a Complete Panel
Step 1: Calculate all possible combinations
Code
# How many unique stations?length(unique(dat_census$from_station_id)) # e.g., 600 stations# How many unique hours?length(unique(dat_census$interval60)) # e.g., 744 hours (31 days)# Total combinations needed:600 stations × 744 hours =446,400 rows
Reality check: Do we have 446,400 rows? Probably not!
expand.grid() to the Rescue
Code
# Create every possible station-hour combinationstudy.panel <-expand.grid(interval60 =unique(dat_census$interval60),from_station_id =unique(dat_census$from_station_id))# Join to actual trip countsstudy.panel <- study.panel %>%left_join( dat_census %>%group_by(interval60, from_station_id) %>%summarize(Trip_Count =n()),by =c("interval60", "from_station_id") ) %>%mutate(Trip_Count =replace_na(Trip_Count, 0)) # Fill missing with 0
Now every station-hour exists, even if Trip_Count = 0
Joining Station Attributes
Each station has fixed characteristics:
Code
# Station location, demographics from censusstation_data <- dat_census %>%group_by(from_station_id) %>%summarize(from_latitude =first(from_latitude),from_longitude =first(from_longitude),Med_Inc =first(Med_Inc),Percent_White =first(Percent_White),# ... other demographics )# Join to panelstudy.panel <- study.panel %>%left_join(station_data, by ="from_station_id")
Result: Every row has station location + demographics
Key difference: We use PAST outcomes as features, not NEIGHBOR outcomes
The Temporal Validation Problem
You CANNOT train on the future to predict the past!
WRONG approach:
Code
# Train on Weeks 3-4train <- data %>%filter(week >=19)# Test on Weeks 1-2test <- data %>%filter(week <19)
This is predicting the past using the future!
** CORRECT approach:**
Code
# Train on Weeks 1-2train <- data %>%filter(week <19)# Test on Weeks 3-4test <- data %>%filter(week >=19)
This is predicting the future using the past!
Why This Matters
Real-world scenario:
It’s May 15, 2018. You need to forecast demand for May 16-31.
You have data from: May 1-15 ✓
You don’t have data from: May 16-31 (it hasn’t happened yet!)
You must train on May 1-15 and test on May 16-31
Temporal Train/Test Split
Code
# Split by time (week of year)train <- study.panel %>%filter(week <19) # Weeks 1-2 of May (early period)test <- study.panel %>%filter(week >=19) # Weeks 3-4 of May (later period)# Fit models on training data onlymodel <-lm(Trip_Count ~ lag1Hour + lag1day + Temperature + weekend, data = train)# Evaluate on test datapredictions <-predict(model, newdata = test)
This mirrors operational deployment: Predict tomorrow using yesterday’s patterns
Comparison to Spatial CV
Week 7: Spatial cross-validation
Prevented spatial leakage
Left out entire neighborhoods for testing
Ensured model generalizes to new areas
This week: Temporal validation
Prevents temporal leakage
Holds out future time periods for testing
Ensures model generalizes to future
Both are about out-of-sample generalization!
Part 7: Building Models
Model Progression Strategy
We’ll build 5 models, adding complexity:
Baseline: Time + Weather only
+ Temporal lags: Add lag1Hour, lag1day
+ Spatial features: Add demographics, location
+ Station fixed effects: Control for station-specific baselines
+ Holiday effects: Account for Memorial Day weekend
Goal: See which features improve prediction accuracy